The primary question you will answer is whether daily concentrations of PM2.5 (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022)
Date Source Site ID POC
Length:15976 Length:15976 Min. :60010007 Min. :1.000
Class :character Class :character 1st Qu.:60290014 1st Qu.:1.000
Mode :character Mode :character Median :60590007 Median :1.000
Mean :60549600 Mean :1.581
3rd Qu.:60731002 3rd Qu.:1.000
Max. :61131003 Max. :6.000
Daily Mean PM2.5 Concentration UNITS DAILY_AQI_VALUE
Min. : 0.00 Length:15976 Min. : 0.00
1st Qu.: 7.00 Class :character 1st Qu.: 29.00
Median : 12.00 Mode :character Median : 50.00
Mean : 16.12 Mean : 53.68
3rd Qu.: 20.50 3rd Qu.: 69.00
Max. :104.30 Max. :176.00
Site Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
Length:15976 Min. :1 Min. :100 Min. :88101
Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
Mode :character Median :1 Median :100 Median :88101
Mean :1 Mean :100 Mean :88215
3rd Qu.:1 3rd Qu.:100 3rd Qu.:88502
Max. :1 Max. :100 Max. :88502
AQS_PARAMETER_DESC CBSA_CODE CBSA_NAME STATE_CODE
Length:15976 Min. :12540 Length:15976 Min. :6
Class :character 1st Qu.:23420 Class :character 1st Qu.:6
Mode :character Median :40140 Mode :character Median :6
Mean :33270 Mean :6
3rd Qu.:41740 3rd Qu.:6
Max. :49700 Max. :6
NA's :929
STATE COUNTY_CODE COUNTY SITE_LATITUDE
Length:15976 Min. : 1.00 Length:15976 Min. :32.63
Class :character 1st Qu.: 29.00 Class :character 1st Qu.:34.07
Mode :character Median : 59.00 Mode :character Median :35.36
Mean : 54.78 Mean :36.00
3rd Qu.: 73.00 3rd Qu.:37.77
Max. :113.00 Max. :41.71
SITE_LONGITUDE
Min. :-124.2
1st Qu.:-121.4
Median :-119.1
Mean :-119.4
3rd Qu.:-117.9
Max. :-115.5
summary(data_22)
Date Source Site ID POC
Length:56140 Length:56140 Min. :60010007 Min. : 1.000
Class :character Class :character 1st Qu.:60310004 1st Qu.: 1.000
Mode :character Mode :character Median :60631006 Median : 3.000
Mean :60567850 Mean : 2.549
3rd Qu.:60750005 3rd Qu.: 3.000
Max. :61131003 Max. :21.000
Daily Mean PM2.5 Concentration UNITS DAILY_AQI_VALUE
Min. : -2.20 Length:56140 Min. : 0.00
1st Qu.: 4.20 Class :character 1st Qu.: 18.00
Median : 6.90 Mode :character Median : 29.00
Mean : 8.52 Mean : 32.84
3rd Qu.: 10.80 3rd Qu.: 45.00
Max. :302.50 Max. :353.00
Site Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
Length:56140 Min. :1 Min. :100 Min. :88101
Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
Mode :character Median :1 Median :100 Median :88101
Mean :1 Mean :100 Mean :88197
3rd Qu.:1 3rd Qu.:100 3rd Qu.:88101
Max. :1 Max. :100 Max. :88502
AQS_PARAMETER_DESC CBSA_CODE CBSA_NAME STATE_CODE
Length:56140 Min. :12540 Length:56140 Min. :6
Class :character 1st Qu.:31080 Class :character 1st Qu.:6
Mode :character Median :40140 Mode :character Median :6
Mean :35340 Mean :6
3rd Qu.:41860 3rd Qu.:6
Max. :49700 Max. :6
NA's :4199
STATE COUNTY_CODE COUNTY SITE_LATITUDE
Length:56140 Min. : 1.00 Length:56140 Min. :32.58
Class :character 1st Qu.: 31.00 Class :character 1st Qu.:34.14
Mode :character Median : 63.00 Mode :character Median :36.50
Mean : 56.64 Mean :36.33
3rd Qu.: 75.00 3rd Qu.:37.97
Max. :113.00 Max. :41.76
SITE_LONGITUDE
Min. :-124.2
1st Qu.:-121.5
Median :-119.7
Mean :-119.7
3rd Qu.:-118.1
Max. :-115.5
sum(is.na(data_02))
[1] 929
sum(is.na(data_22))
[1] 4199
The 2002 data has 15975 rows (observations) and 20 columns (variables). The 2022 data has 56140 rows (observations) and 20 columns (variables). There were 929 NAʼs in 2002 data and 4199 NAʼs in 2022 data, all of which were in the “CBSA_CODE” category.
When checking our data, we see that daily concentrations of PM2.5 in the 2022 data have a minimum of -2.5. It doesn’t make sense for the concentration to be less than zero, so we should remove those values less than zero.
data_22 <- data_22[data_22$`Daily Mean PM2.5 Concentration`>=0, ]summary(data_22$`Daily Mean PM2.5 Concentration`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 4.200 7.000 8.554 10.800 302.500
Step 2: Combine Data
combined_data <-rbindlist(list(data_02[, year :=2002],data_22[, year :=2022]))setnames(combined_data, c("Daily Mean PM2.5 Concentration", "SITE_LATITUDE", "SITE_LONGITUDE"), c("PM2.5", "lat", "lon"))setnames(data_02, c("Daily Mean PM2.5 Concentration", "SITE_LATITUDE", "SITE_LONGITUDE"), c("PM2.5", "lat", "lon"))setnames(data_22, c("Daily Mean PM2.5 Concentration", "SITE_LATITUDE", "SITE_LONGITUDE"), c("PM2.5", "lat", "lon"))head(combined_data)
Date Source Site ID POC PM2.5 UNITS DAILY_AQI_VALUE Site Name
1: 01/05/2002 AQS 60010007 1 25.1 ug/m3 LC 78 Livermore
2: 01/06/2002 AQS 60010007 1 31.6 ug/m3 LC 92 Livermore
3: 01/08/2002 AQS 60010007 1 21.4 ug/m3 LC 71 Livermore
4: 01/11/2002 AQS 60010007 1 25.9 ug/m3 LC 80 Livermore
5: 01/14/2002 AQS 60010007 1 34.5 ug/m3 LC 98 Livermore
6: 01/17/2002 AQS 60010007 1 41.0 ug/m3 LC 115 Livermore
DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE AQS_PARAMETER_DESC
1: 1 100 88101 PM2.5 - Local Conditions
2: 1 100 88101 PM2.5 - Local Conditions
3: 1 100 88101 PM2.5 - Local Conditions
4: 1 100 88101 PM2.5 - Local Conditions
5: 1 100 88101 PM2.5 - Local Conditions
6: 1 100 88101 PM2.5 - Local Conditions
CBSA_CODE CBSA_NAME STATE_CODE STATE
1: 41860 San Francisco-Oakland-Hayward, CA 6 California
2: 41860 San Francisco-Oakland-Hayward, CA 6 California
3: 41860 San Francisco-Oakland-Hayward, CA 6 California
4: 41860 San Francisco-Oakland-Hayward, CA 6 California
5: 41860 San Francisco-Oakland-Hayward, CA 6 California
6: 41860 San Francisco-Oakland-Hayward, CA 6 California
COUNTY_CODE COUNTY lat lon year
1: 1 Alameda 37.68753 -121.7842 2002
2: 1 Alameda 37.68753 -121.7842 2002
3: 1 Alameda 37.68753 -121.7842 2002
4: 1 Alameda 37.68753 -121.7842 2002
5: 1 Alameda 37.68753 -121.7842 2002
6: 1 Alameda 37.68753 -121.7842 2002
Overall, there seem to be a lot more sites present in 2022 than in 2002. There seem to be sites all over California except for in the southeast part. While there are sites across the whole state, a lot of sites seem to be concentrated around bigger cities, such as Los Angeles.
Step 4:
sum(is.na(combined_data$PM2.5))
[1] 0
sum(combined_data$PM2.5<0)
[1] 0
summary(combined_data$PM2.5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 4.60 7.70 10.23 12.40 302.50
It appears that there are no missing or implausible values in the combined data set.
Step 5: Daily concentrations of PM2.5 in CA at three different spatial levels in 2002 and 2022
The primary question you will answer is whether daily concentrations of PM2.5 (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).
Welch Two Sample t-test
data: data_02$PM2.5 and data_22$PM2.5
t = 66.136, df = 18805, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
7.337922 7.786156
sample estimates:
mean of x mean of y
16.115943 8.553904
Looking at the summary statistics by year and the barplots, we can see that the average daily PM2.5 concentration is lower in 2022 than in 2002. However, 2022 has a higher max daily PM2.5 concentration (302.5 vs 104.3 in 2002). We can see this max as an apparent spike in the 2022 time series plot, which looks to be approximately in the late summer to early fall. The t-test shows that this difference in average daily PM2.5 concentration was statistically significant different (p<0.001).
Column 2 ['Average_PM_2022'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
library(ggplot2)# Filter the data for 2002 and 2022data_02 <- combined_data[combined_data$year ==2002, ]data_22 <- combined_data[combined_data$year ==2022, ]# Calculate average PM2.5 concentrations by county for both yearsaverage_pm_by_county_2002 <- data_02 %>%group_by(COUNTY) %>%summarize(Average_PM2.5 =mean(PM2.5, na.rm =TRUE))average_pm_by_county_2022 <- data_22 %>%group_by(COUNTY) %>%summarize(Average_PM2.5 =mean(PM2.5, na.rm =TRUE))# Merge the two datasets for comparisoncomparison_data <-merge(average_pm_by_county_2002, average_pm_by_county_2022, by ="COUNTY", suffixes =c("_2002", "_2022"))# Reshape the data into long formatcomparison_data_long <-pivot_longer(comparison_data, cols =starts_with("Average_PM2.5"), names_to ="Year", values_to ="Average_PM2.5")# Create the side-by-side bar plotggplot(comparison_data_long, aes(x = COUNTY, y = Average_PM2.5, fill = Year)) +geom_bar(stat ="identity", position =position_dodge(width =0.7), alpha =0.7, width =0.7) +labs(title ="Average PM2.5 Concentrations by County (2002 vs. 2022)",x ="County", y ="Average PM2.5 Concentration") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +scale_fill_manual(values =c("blue", "red"), labels =c("2002", "2022")) +guides(fill =guide_legend(title ="Year"))
Looking at the maps and bar plot, we can see that average PM2.5 concentrations in each county are higher in 2002 than 2022 in general.
County Level
# Filter the data for Los Angeles Countylos_angeles_data <- combined_data %>%filter(COUNTY =="Los Angeles")# Get unique site names in Los Angeles Countyunique_site_names <-unique(los_angeles_data$`Site Name`)unique_site_names
# Filter the data for Pasadena and the years 2002 and 2022Pasadena_data_2002 <- Pasadena_data %>%filter(year ==2002)Pasadena_data_2022 <- Pasadena_data %>%filter(year ==2022)# Create a bar graph comparing PM2.5 in 2002 vs 2022library(ggplot2)ggplot() +geom_bar(data = Pasadena_data_2002, aes(x ="2002", y = PM2.5), stat ="identity", fill ="red", width =0.5) +geom_bar(data = Pasadena_data_2022, aes(x ="2022", y = PM2.5), stat ="identity", fill ="blue", width =0.5) +labs(title ="PM2.5 Concentration in Pasadena (2002 vs 2022)", x ="Year", y ="PM2.5 Concentration") +theme_minimal()
Welch Two Sample t-test
data: Pasadena_data_2002$PM2.5 and Pasadena_data_2022$PM2.5
t = 10.491, df = 146.06, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
9.087496 13.305989
sample estimates:
mean of x mean of y
20.290909 9.094167
Looking at the summary statistics and bar plot, we can see that average daily PM2.5 concentration in Pasadena was higher in 2002 vs 2022. The t-test shows that this difference in average daily PM2.5 concentration was statistically significant different (p<0.001).